Overview
Background
This document sets out a few practical recipes for modeling with insurance data, covering
- Common data transforms, summary stats, and simple visualisations
- Regression
- Grouped vs ungrouped data
- Choice of: response distribution, link (and offsets), explanatory variables
- Modeling variance to industry/ reference (A/E or A - E)
- Model selection - stepwise regression, likelihood tests
- Predictions, confidence intervals and visualisations
Libraries
A list of packages used in the recipe book.
library(rmdformats) # theme for the HTML doc
library(bookdown) # bibliography formatting
library(kableExtra) # formatting tables
library(scales) # data formatting
library(dplyr) # tidyverse: data manipulation
library(tidyr) # tidyverse: tidy messy data
library(corrplot) # correlation matrix visualisation, optional
library(ggplot2) # graphs
library(GGally) # extension of ggplot2
library(ggthemes) # additional themes for ggplot, optional
library(plotly) # graphs, including 3D
library(MASS) # statistics
library(pROC) # visualising ROC curvesData Simulation
This section sets out a method for generating dummy data. The simulated data is intended to reflect typical data used in an analysis of disability income incidence experience and is used throughout this analysis. Replace this data with your actual data.
Simulating policies
Here we are using simulated data to fit a model. Replace with own data. We start by simulating a mix of 200k policies over 3 years. Some simplifying assumptions e.g. nil lapse/ new bus, no indexation. Mix of business assumptions for benefit period, waiting period and occupation taken taken from (James Louw 2012), with the remainder based on an anecdotal view of industry mix not intended to be reflective of any one business.
# set the seed value (for the random number generator) so that the simulated data frame can be replicated later
set.seed(10)
# create 200k policies
n <- 200000
# data frame columns
# policy_year skewed to early years, but tail is fat
df <- data.frame(id = c(1:n), cal_year = 2018,policy_year = round(rweibull(n, scale=5, shape=2),0))
df <- df %>% mutate(sex = replicate(n,sample(c("m","f","u"), size=1, replace=TRUE, prob=c(.75,.20,.05))),
smoker = replicate(n,sample(c("n","s","u"), size=1, replace=TRUE, prob=c(.85,.1,.05))),
# mix of business for benefit_period, waiting_period, occupation taken from industry presentation
benefit_period = replicate(n,sample(c("a65","2yr","5yr"), size=1, replace=TRUE, prob=c(.76,.12,.12))),
waiting_period = replicate(n,sample(c("14d","30d","90d","720d"), size=1, replace=TRUE, prob=c(.04,.7,.15,.11))),
occupation = replicate(n,sample(c("prof","sed","techn","blue","white"), size=1, replace=TRUE, prob=c(.4,.2,.2,.1,.1))),
# age and policy year correlated; age normally distributed around 40 + policy_year (where policy_year is distributed around 5 years), floored at 25, capped at 60
age = round(pmax(pmin(rnorm(n,mean = 40+policy_year, sd = 5),60),25),0),
# sum_assured, age and occupation are correlated; sum assured normally distributed around some mean (dependent on age rounded to 10 and on occupation), floored at 500
sum_assured =
round(
pmax(
rnorm(n,mean = (round(age,-1)*100+ 1000) *
case_when(occupation %in% c("white","prof") ~ 1.5, occupation %in% c("sed") ~ 1.3 , TRUE ~ 1),
sd = 2000),500),
0)
)
# generate 3 years of exposure for the 200k policies => assume no lapses or new business
df2 <- df %>% mutate(cal_year=cal_year+1,policy_year=policy_year+1,age=age+1)
df3 <- df2 %>% mutate(cal_year=cal_year+1,policy_year=policy_year+1,age=age+1)
df <- rbind(df,df2,df3)Expected claim rate
Set p values from which to simulate claims. The crude p values below were derived from the Society of Actuaries Analysis of USA Individual Disability Claim Incidence Experience from 2006 to 2014 (SOA 2019), with some allowance for Australian industry differentials (Ian Welch 2020).
# by cause, age and sex, based upon polynomials fitted to crude actual rates
# sickness
f_sick_age_m <- function(age) {-0.0000003*age^3 + 0.000047*age^2 - 0.00203*age + 0.02715}
f_sick_age_f <- function(age) {-0.0000002*age^3 + 0.000026*age^2 - 0.00107*age + 0.01550}
f_sick_age_u <- function(age) {f_sick_age_f(age)*1.2}
f_sick_age <- function(age,sex) {case_when(sex == "m" ~ f_sick_age_m(age), sex == "f" ~ f_sick_age_f(age), sex == "u" ~ f_sick_age_u(age))}
# accident
f_acc_age_m <- function(age) {-0.00000002*age^3 + 0.000004*age^2 - 0.00020*age + 0.00340}
f_acc_age_f <- function(age) {-0.00000004*age^3 + 0.000007*age^2 - 0.00027*age + 0.00374}
f_acc_age_u <- function(age) {f_sick_age_f(age)*1.2}
f_acc_age <- function(age,sex) {case_when(sex == "m" ~ f_acc_age_m(age), sex == "f" ~ f_acc_age_f(age), sex == "u" ~ f_acc_age_u(age))}
# smoker, wp and occ based upon ratio of crude actual rates by category
# occupation adjustment informed by FSC commentary on DI incidence experience
f_smoker <- function(smoker) {case_when(smoker == "n" ~ 1, smoker == "s" ~ 1.45, smoker == "u" ~ 0.9)}
f_wp <- function(waiting_period) {case_when(waiting_period == "14d" ~ 1.4, waiting_period == "30d" ~ 1, waiting_period == "90d" ~ 0.3, waiting_period == "720d" ~ 0.2)}
f_occ_sick <- function(occupation) {case_when(occupation == "prof" ~ 1, occupation == "sed" ~ 1, occupation == "techn" ~ 1, occupation == "blue" ~ 1, occupation == "white" ~ 1)}
f_occ_acc <- function(occupation) {case_when(occupation == "prof" ~ 1, occupation == "sed" ~ 1, occupation == "techn" ~ 4.5, occupation == "blue" ~ 4.5, occupation == "white" ~ 1)}
# anecdotal allowance for higher rates at larger policy size and for older policies
f_sa_sick <- function(sum_assured) {case_when(sum_assured<=6000 ~ 1, sum_assured>6000 & sum_assured<=10000 ~ 1.1, sum_assured>10000 ~ 1.3)}
f_sa_acc <- function(sum_assured) {case_when(sum_assured<=6000 ~ 1, sum_assured>6000 & sum_assured<=10000 ~ 1, sum_assured>10000 ~ 1)}
f_pol_yr_sick <- function(policy_year) {case_when(policy_year<=5 ~ 1, policy_year>5 & policy_year<=10 ~ 1.1, policy_year>10 ~ 1.3)}
f_pol_yr_acc <- function(policy_year) {case_when(policy_year<=5 ~ 1, policy_year>5 & policy_year<=10 ~ 1, policy_year>10 ~ 1)}Expected claims
Add the crude p values to the data and simulate 1 draw from a binomial with prob = p for each record. Gives us a vector of claim/no-claim for each policy. Some simplifying assumptions like independence of sample across years for each policy and independence of accident and sickness incidences.
# add crude expected
df$inc_sick_expected=f_sick_age(df$age,df$sex)*f_smoker(df$smoker)*f_wp(df$waiting_period)*f_occ_sick(df$occupation)*f_sa_sick(df$sum_assured)*f_pol_yr_sick(df$policy_year)
df$inc_acc_expected=f_acc_age(df$age,df$sex)*f_smoker(df$smoker)*f_wp(df$waiting_period)*f_occ_acc(df$occupation)*f_sa_acc(df$sum_assured)*f_pol_yr_acc(df$policy_year)
# add prediction
df$inc_count_sick = sapply(df$inc_sick_expected,function(z){rbinom(1,1,z)})
df$inc_count_acc = sapply(df$inc_acc_expected,function(z){rbinom(1,1,z)})*(1-df$inc_count_sick)
df$inc_count_tot = df$inc_count_sick + df$inc_count_accData Exploration
The sections below rely heavily upon the dplyr package.
Data structure
Looking at the metadata for the data frame and a sample of the contents.
# glimpse() or str() returns detail on the structure of the data frame. Our data consists of 600k rows and 15 columns. The columns are policy ID, several explanatory variables like sex and smoker, expected counts of claim (inc_sick_expected and inc_acc_expected) and actual counts of claim (inc_count_sick/acc/tot).
glimpse(df)## Rows: 600,000
## Columns: 15
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ cal_year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018~
## $ policy_year <dbl> 4, 5, 5, 3, 8, 6, 6, 6, 3, 5, 3, 4, 7, 4, 5, 5, 9, 6~
## $ sex <chr> "m", "f", "m", "m", "f", "f", "m", "m", "m", "m", "m~
## $ smoker <chr> "n", "n", "n", "n", "n", "n", "n", "n", "s", "n", "n~
## $ benefit_period <chr> "a65", "5yr", "a65", "a65", "a65", "2yr", "a65", "a6~
## $ waiting_period <chr> "90d", "30d", "30d", "30d", "30d", "14d", "720d", "3~
## $ occupation <chr> "techn", "blue", "blue", "prof", "sed", "prof", "tec~
## $ age <dbl> 41, 46, 41, 27, 53, 49, 54, 42, 43, 52, 36, 47, 47, ~
## $ sum_assured <dbl> 7119, 5582, 6113, 5147, 8864, 11209, 6378, 6780, 597~
## $ inc_sick_expected <dbl> 0.000742731, 0.001828800, 0.002475770, 0.000698100, ~
## $ inc_acc_expected <dbl> 0.0007365330, 0.0100735200, 0.0024551100, 0.00052234~
## $ inc_count_sick <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ inc_count_acc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ inc_count_tot <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
# head() returns the first 6 rows of the data frame. Similar to head(), sample_n() returns rows from our data frame, however these are chosen randomly. e.g. sample_n(df,5,replace=FALSE)
head(df) %>% kbl(caption = "Sample of data")| id | cal_year | policy_year | sex | smoker | benefit_period | waiting_period | occupation | age | sum_assured | inc_sick_expected | inc_acc_expected | inc_count_sick | inc_count_acc | inc_count_tot |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2018 | 4 | m | n | a65 | 90d | techn | 41 | 7119 | 0.0007427 | 0.0007365 | 0 | 0 | 0 |
| 2 | 2018 | 5 | f | n | 5yr | 30d | blue | 46 | 5582 | 0.0018288 | 0.0100735 | 0 | 0 | 0 |
| 3 | 2018 | 5 | m | n | a65 | 30d | blue | 41 | 6113 | 0.0024758 | 0.0024551 | 0 | 0 | 0 |
| 4 | 2018 | 3 | m | n | a65 | 30d | prof | 27 | 5147 | 0.0006981 | 0.0005223 | 0 | 0 | 0 |
| 5 | 2018 | 8 | f | n | a65 | 30d | sed | 53 | 8864 | 0.0024788 | 0.0031379 | 0 | 0 | 0 |
| 6 | 2018 | 6 | f | n | 2yr | 14d | prof | 49 | 11209 | 0.0039363 | 0.0036555 | 0 | 0 | 0 |
# class() returns the class of a column.
class(df$benefit_period)## [1] "character"
Factors
From the above you’ll note that the categorical columns are stored as characters. Factorising these makes them easier to work with in our models e.g. for BP factorise a65|2yr|5yr as 1|2|3. Factors are stored as integers and have labels that tell us what they are, they can be ordered and are useful for statistical analysis.
# table() returns a table of counts at each combination of column values. prop.table() converts these to a proportion. For example, applying this to the column "sex" shows us that ~75% of our data is "m" and that the other data are either "f" or "u" (unknown).
table(df$sex)##
## f m u
## 120951 449487 29562
prop.table(table(df$sex))##
## f m u
## 0.201585 0.749145 0.049270
# we can then convert the columns to factors based upon the values of the column and ordering by frequency.
df$sex <- factor(df$sex, levels = c("m","f","u"))
df$smoker <- factor(df$smoker, levels = c("n","s","u"))
df$benefit_period <- factor(df$benefit_period, levels = c("a65","2yr","5yr"))
df$waiting_period <- factor(df$waiting_period, labels = c("30d","14d","720d","90d"))
df$occupation <- factor(df$occupation, labels = c("prof", "sed","techn","white","blue"))
# if the column is already a factor, you can extract the levels to show what order they will be used in our models
levels(df$sex)## [1] "m" "f" "u"
Selection methods
table() is a method of summarizing data, returning a count at each combination of values in a column. sample() and sample_n() are other examples of selection methods. This section (not exhaustive) looks at a few more selection methods in dplyr.
# data subsets: e.g. select from df where age <25 or >60
subset(df, age <25 | age > 60)
# dropping columns:
# exclude columns
mycols <- names(df) %in% c("cal_year", "smoker")
new_df <- df[!mycols]
# exclude 3rd and 5th column
new_df <- df[c(-3,-5)]
# delete columns v1 and v2 from new_df
new_df$v3 <- new_df$v5 <- NULL
# keeping columns:
# select variables v1, v2, v3
mycols <- names(df) %in% c("cal_year", "smoker")
new_df <- df[!mycols]
# select 1st and 5th to 7th variables
new_df <- df[c(1,5:7)]Manipulation methods
We might want to modify our data frame to prepare it for fitting our models. The section below looks at a few simple data manipulations. Here we also introduce the infix operator (%>%); this operator passes the argument to the left of it over to the code on the right, so df %>% “operation” passes the data frame df over to the operation on the right.
# create a copy of the dataframe to work from
new_df <- df
# simple manipulations
# select as in the selection methods section, but using infix
new_df %>% select(id, age) # or a range using select(1:5) or select(contains("sick")) or select(starts_with("inc")); others e.g. ends_with(), last_col(), select(-age)
# replace values in a column
replace(new_df$sex,new_df$sex=="u","m") # no infix in base r
# Rename, id to pol_id
new_df %>% rename(pol_id = id) #or (reversing the renaming)
new_df %>% select(pol_id = id)
# alter data
new_df <- new_df %>% mutate(inc_tot_expected = inc_acc_expected + inc_sick_expected) # need to assign the output back to the data frame
# transmute - select and mutate simultaneously
new_df2 <- new_df %>% transmute(id, age, birth_year = cal_year - age)
# sort
new_df %>% arrange(desc(age))
# filter
new_df %>% filter(benefit_period == "a65", age <65) # or
new_df %>% filter(benefit_period %in% c("a65","5yr"))
# aggregations
# group by, also ungroup()
new_df %>% group_by(sex) %>% # can add a mutate to group by which will aggregate only to the level specified in the group_by e.g.
mutate(sa_by_sex = sum(sum_assured)) # adds a new column with the total sum assured by sex.
# after doing this, ungroup() in order to apply future operations to all records individually
# count, sorting by most frequent and weighting by another column
new_df %>% count(sex, wt= sum_assured, sort=TRUE) # counts the number of entries for each value of sex, weighted by sum assured
# summarize takes many observations and turns them into one observation. mean(), median(), min(), max(), and n() for the size of the group
new_df %>% summarize(total = sum(sum_assured), min_age = min(age), max_age = max(age), max(inc_tot_expected))
new_df %>% group_by(sex) %>% summarise(n = n())
table(new_df$sex) # returns count by sex; no infix in base r
# outliers
new_df %>% top_n(10, inc_tot_expected) # also operates on grouped table - returns top n per group
# window functions
# lag - offset vector by 1 e.g. v <- c(1,3,6,14); so - lag(v) = NA 1 3 6
new_df %>% arrange(id,age) %>% mutate(ifelse(id==lag(id),age - lag(age),1))Missing data
By default, the regression model will exclude any observation with missing values on its predictors. Missing values can be treated as a seperate category for categorical data. For missing numeric data, imputation is a potential solution.
# find the average age among non-missing values
summary(df$age)
# impute missing age values with the mean age
df$imputed_age <- ifelse(is.na(df$age)==TRUE,round(mean(df$age, na.rm=TRUE),2),df$age)
# create missing value indicator for age
df$missing_age <- ifelse(is.na(df$age)==TRUE,1,0)Review exposure data
The tables and graphs that follow look at:
- the mix of business over rating factors using some of the selection methods described: These are all consistent with the simulation specification.
- the correlation of ordered numerical rating factors: age and sum assured as well as age and policy year are positively correlated.
# look at distribution by single rating factors
df %>% count(benefit_period, wt = NULL, sort = TRUE) %>% mutate(freq = percent(round(n / sum(n),2))) %>% format(n, big.mark=",") %>% kbl(caption = "Benefit period mix")| benefit_period | n | freq |
|---|---|---|
| a65 | 456,552 | 76% |
| 2yr | 72,159 | 12% |
| 5yr | 71,289 | 12% |
df %>% count(waiting_period, wt = NULL, sort = TRUE) %>% mutate(freq = percent(round(n / sum(n),2))) %>% format(n, big.mark=",") %>% kbl(caption = "Waiting period mix")| waiting_period | n | freq |
|---|---|---|
| 14d | 420,009 | 70.0% |
| 90d | 90,717 | 15.0% |
| 720d | 65,814 | 11.0% |
| 30d | 23,460 | 4.0% |
df %>% count(occupation, wt = NULL, sort = TRUE) %>% mutate(freq = percent(round(n / sum(n),2))) %>% format(n, big.mark=",") %>% kbl(caption = "Occupation mix")| occupation | n | freq |
|---|---|---|
| sed | 240,348 | 40% |
| techn | 120,717 | 20% |
| white | 119,847 | 20% |
| blue | 59,613 | 10% |
| prof | 59,475 | 10% |
# consider a histogram to show the distribution of numeric data
hist(df$age, main = "Histogram of age", xlab = "Age", ylab = "Frequency")hist(df$sum_assured, main = "Histogram of sum assured", xlab = "Sum assured", ylab = "Frequency")hist(df$policy_year, main = "Histogram of policy year", xlab = "Policy year", ylab = "Frequency")# correlation of ordered numeric explanatory variables
# pairs() gives correlation matrix and plots; test on a random sample from our data
df_sample <- sample_n(df,10000,replace=FALSE)
df_sample %>% select(age,policy_year,sum_assured) %>% pairs# or cor() to return just the correlation matrix
cor <-df_sample %>% select(age,policy_year,sum_assured) %>% cor
cor## age policy_year sum_assured
## age 1.0000000 0.4333434 0.2937906
## policy_year 0.4333434 1.0000000 0.1249400
## sum_assured 0.2937906 0.1249400 1.0000000
# corrplot() is an alternative to visualise a correlation matrix
corrplot(cor)#ggpairs() similarly shos correlations for ordered numeric data as well as other summary stats
df_sample %>% select(age,policy_year,sum_assured,sex, smoker) %>% ggpairs# summary statistics for subsets of data
head(aggregate(df$sum_assured~df$age,data=df,mean))## df$age df$sum_assured
## 1 25 3584.319
## 2 26 4431.558
## 3 27 4735.551
## 4 28 5066.777
## 5 29 5068.259
## 6 30 5204.272
Data format
There are two main formats for structured data - long and wide. For regression, the structure of data informs the model structure. For counts data:
- the long format corresponds to Bernoulli (claim or no claim for each observation) and allows for predictor variables by observation;
- the wide format correspond to Binomial (count of claims per exposure). Wide format structures can include matrix of successes and failures or a proportion of successes and corresponding weights / number of observations/exposure for each line.
There are several tidyverse functions that can help with restructuring data, for example:
# Convert data into wide format e.g.separate into a separate column for each value of sex:
head(spread(df_sample, sex, inc_count_tot, fill = 0)) %>% kbl(caption = "Wide data format example")| id | cal_year | policy_year | smoker | benefit_period | waiting_period | occupation | age | sum_assured | inc_sick_expected | inc_acc_expected | inc_count_sick | inc_count_acc | m | f | u |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 186530 | 2019 | 6 | n | a65 | 14d | sed | 45 | 7850 | 0.0044014 | 0.0006775 | 0 | 0 | 0 | 0 | 0 |
| 89986 | 2020 | 4 | n | a65 | 14d | techn | 48 | 8359 | 0.0053024 | 0.0008042 | 0 | 0 | 0 | 0 | 0 |
| 12688 | 2019 | 6 | n | 5yr | 14d | white | 48 | 9391 | 0.0058327 | 0.0036187 | 0 | 0 | 0 | 0 | 0 |
| 16974 | 2019 | 6 | s | a65 | 14d | white | 49 | 2687 | 0.0083455 | 0.0055529 | 0 | 0 | 0 | 0 | 0 |
| 70839 | 2020 | 7 | n | a65 | 14d | prof | 41 | 3597 | 0.0024758 | 0.0024551 | 0 | 0 | 0 | 0 | 0 |
| 120872 | 2018 | 4 | n | a65 | 14d | prof | 46 | 5177 | 0.0040212 | 0.0032278 | 0 | 0 | 0 | 0 | 0 |
# gather() converts data into long format
# also pivot_longer() and pivot_wider()Visualisation methods
Intriduction to ggplot
This section sets out some simple visualisation methods using ggplot(). ggplot() Initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden (pkgdown, n.d.a). The form of ggplot is:
ggplot(data = df, mapping = aes(x,y, other aesthetics), …)
Examples below use ggplot to explore the exposure data.
# data argument passes the data frame to be visualised
# mapping argument defines a list of aesthetics for the visualisation - all subsequent layers use those unless overridden
# typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.
ggplot(data=df_sample, mapping=aes(x=age, y=sum_assured)) + # the '+' adds the layer below
# add subsequent visualisation layers, e.g. geom_point() for scatterplot
geom_point() +
# add a layer to change axis labels
labs(x="Age", y="Sum assured", title = "Sum Assured by age")# could add a layer to specify axis limits with ylim() and xlim() Layers
The aesthetics input has a number of different options, for example x and y (axes), colour, size, fill, labels, alpha (transparency), shape, line type/ width. You can change the aesthetics of each layer or default to the base layer. You can change the general look and feel of charts with a themes layer e.g. colour palette (see more in the next section).
You can add more layers to the base plot, for example
- Geometries (geom_), for example
- point - scatterplot,
- line,
- histogram,
- bar/ column,
- boxplot,
- density,
- jitter - adds random noise to separate points,
- count - counts the number of observations at each location, then maps the count to point area,
- abline - adds a reference line - vertical, horizontal or diagonal,
- curve - adds a curved line to the chart between specified points,
- text - add a text layer to label data points.
- Statistics (stat_)
- smooth (curve fitted to the data),
- bin (e.g. for histogram).
- smooth (curve fitted to the data),
A note on overlapping points: these can be adjusted for by adding noise and transparency to your points:
- within an existing geom e.g. geom_point(position="*“) with options including: identity (default = position is as per data), dodge (dodge overlapping objects side-to-side), jitter (random noise), jitterdodge, and nudge (nudge points a fixed distance) e.g. geom_bar(position =”dodge") or geom_bar(position=position_dodge(width=0.2)).
- or use geom_* with arguments e.g. geom_jitter(alpha = 0.2, shape=1). Shape choice might help, shape = 1 gives hollow circles.
Or alternatively count overlapping points with geom_count().
A full list of layers is available here.
ggplot(data=df_sample, aes(x=age, y=sum_assured)) +
geom_point() +
# separate overlapping points
geom_jitter(alpha = 0.2, width = 0.2) +
# add a smoothing line
geom_smooth(method = "glm", se=FALSE) ## `geom_smooth()` using formula 'y ~ x'
Themes
You can add a themes layer to your graph (pkgdown, n.d.b), for example
theme_gray() Gray background and white grid lines. theme_bw() White background and gray grid lines. theme_linedraw() A theme with only black lines of various widths on white backgrounds. theme_light() A theme similar to theme_linedraw() but with light grey lines and axes, to direct more attention towards the data. theme_dark() Similar to theme_light() but with a dark background. Others e.g. theme_minimal() and theme_classic()
Other packages like ggthemes carry many more options. Example of added themes layer below. See also these examples these examples from ggthemes.
# add an occupation group to the data
df_sample <- df_sample %>% mutate(occ_group = factor(case_when(occupation %in% c("white","prof","sed") ~ "WC", TRUE ~ "BC")))
# vary colour by occupation
ggplot(data=df_sample, aes(x=age, y=sum_assured, color=occ_group)) +
# jitter and fit a smoothed line as below
geom_jitter(alpha = 0.2, width = 0.2) +
geom_smooth(method = "glm", se=FALSE) +
# add labels
labs(x="Age", y="Sum assured", title = "Sum Assured by age") +
# adding theme and colour palette layers
theme_pander() + scale_color_gdocs()## `geom_smooth()` using formula 'y ~ x'
3D visualisations
ggplot does not cater to 3D visualisations, but this can be done through plotly simply.
plot_base <- plot_ly(data=df_sample, z= ~sum_assured, x= ~age, y=~policy_year, opacity=0.6) %>%
add_markers(color = ~occ_group,colors= c("blue", "red"), marker=list(size=2))
# show graph
plot_base## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
We can add a modeled outcome to the 3D chart. For detail on the model fit, see those sections.
# to add a plane we need to define the points on the plane. To do that, we first create a grid of x and y values, where x and y are defined as earlier.
x_grid <- seq(from = min(df_sample$age), to = max(df_sample$age), length = 50)
y_grid <- seq(from = min(df_sample$policy_year), to = max(df_sample$policy_year), length = 50)
# create a simple model and extract the coefficient estimates
coeff_est <- glm(sum_assured ~ age + policy_year + occ_group,family="gaussian",data=df_sample) %>% coef()
# extract fitted values for z - here we want fitted values for BC and WC seperately, use levels to determine how the model orders the factor occ_group
fitted_values_BC <- crossing(y_grid, x_grid) %>% mutate(z_grid = coeff_est[1] + coeff_est[2]*x_grid + coeff_est[3]*y_grid)
fitted_values_WC <- crossing(y_grid, x_grid) %>% mutate(z_grid = coeff_est[1] + coeff_est[2]*x_grid + coeff_est[3]*y_grid + coeff_est[4])
# convert to matrix
z_grid_BC <- fitted_values_BC %>% pull(z_grid) %>% matrix(nrow = length(x_grid)) %>% t()
z_grid_WC <- fitted_values_WC %>% pull(z_grid) %>% matrix(nrow = length(x_grid)) %>% t()
# define solid colours for the two planes/ surfaces
colorscale_BC = list(c(0, 1), c("red", "red"))
colorscale_WC = list(c(0, 1), c("blue", "blue"))
# use plot base created above, add a surface for BC sum assureds and WC sum assureds
plot_base %>%
add_surface(x = x_grid, y = y_grid, z = z_grid_BC, showscale=FALSE, colorscale=colorscale_BC) %>%
add_surface(x = x_grid, y = y_grid, z = z_grid_WC, showscale=FALSE, colorscale=colorscale_WC) %>%
# filtering sum assured on a narrower range
layout(scene = list(zaxis = list(range=c(4000,12000))))Review claim data
# records, claim vs no claim. should be close to nil overlapping clams. actual claim rate is ~0.003-0.005
df %>% select(inc_count_acc,inc_count_sick) %>% table()## inc_count_sick
## inc_count_acc 0 1
## 0 596750 2087
## 1 1163 0
# use ggplot to plot inc_count_sick by age and sex; using df_sample from earlier
# clearly all of the points are going to be at 0 or 1 and will overlap at each age -> not useful.
df_sample %>% ggplot(aes(x=age,y=inc_count_sick,color=sex)) +
geom_point()# as above but add some random noise around the points to separate them
df_sample %>% ggplot(aes(x=age,y=inc_count_sick,color=sex)) +
geom_point(position=position_jitter(height=0.1))# as above but excluding unknown sex (as there are very few claims observed for that group) and adding a smoothing line (setting method as glm)
# because the claim rate is so low, the smoothed line is very close to zero and so not a particularly useful visualisation.
df_sample %>% filter(sex != "u") %>% ggplot(aes(x=age,y=inc_count_sick,color=sex)) +
geom_point(position=position_jitter(height=0.1)) +
geom_smooth(method="glm", method.args = list(family = "binomial")) # or list(family = binomial(link='logit')## `geom_smooth()` using formula 'y ~ x'
# looking at total count of claim rather than just sickness shows a slight trend by age
df_sample %>% filter(sex != "u") %>% ggplot(aes(x=age,y=inc_count_tot,color=sex)) +
geom_point(position=position_jitter(height=0.1)) +
geom_smooth(method="glm", method.args = list(family = "binomial")) # or list(family = binomial(link='logit')## `geom_smooth()` using formula 'y ~ x'
# given the actual count of claims is so low, it might be more useful to consider the claim rate
# use the manipulation methods from earlier to get claim rates by age and sex for accident and sickness; filter out unknown sex and age with low exposure
# this shows a clear trend by age for males and females
df %>% filter(sex != "u", between(age, 30,60)) %>% group_by(age,sex) %>% summarise(total_sick=sum(inc_count_sick),total_acc=sum(inc_count_acc), exposure=n()) %>%
mutate(sick_rate = total_sick/exposure, acc_rate = total_acc/exposure) %>%
# used ggplot to graph the results
ggplot(aes(x=age,y=sick_rate,color=sex)) +
geom_point() +
geom_line() +
# add a smoothing line
geom_smooth(method = 'glm',se=FALSE)## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
Model selection
The sections below provide a refresher on linear and logistic regression; some considerations for insurance data; model selection and testing model fit.
Models for grouped vs ungrouped data
Splitting data
Split data into training and testing data sets.
# Determine the number of rows for training
nrow(df)## [1] 600000
# Create a random sample of row IDs
sample_rows <- sample(nrow(df),0.75*nrow(df))
# Create the training dataset
df_train <- df[sample_rows,]
# Create the test dataset
df_test <- df[-sample_rows,]Regression
Background
Linear regression is a method of modelling the relationship between a response (dependent variable) and one or more explanatory variables (predictors; independent variables). For a data set , the relationship between y, the reponse/ dependent variables and the vector of x’s, the explanatory variables, is linear of the form
\(y_{i} = \beta_{0} + \beta_{1}x_{i1} + ... + \beta_{p}x_{ip} + \epsilon_{i} = \mathbf{x}_{i}^t\mathbf{\beta} + \epsilon_{i}\), \(i = 1,...,n\)
Key assumptions * linearity - response is some linear combination of regression coefficients and explanatory variables * constant variance (homoskedastic) - variance of errors does not depend upon explanatory variables * errors are independent - response variables uncorrelated * explanatory variables are not perfectly co-linear * weak exogeneity - predictors treated as fixed / no measurement error.
Confirm Monotonic Data are not over dispersed - too many zeros or ones (Binomial); too many zeros/ too large variance (Poisson); changing variance Cover the choice of: response distribution, link (and offsets), explanatory variables
Linear vs logistic regression
Logistic: Binary response outcome - straight line does not fit the data well. Fit with a logistic function, the predicted values are always between 0 and 1. Hard to transform to linear chart - probabilities are not linear but log odds are. Transform to odds ratios by taking the exponential of the coefficients or exp(coef(model)) - this shows the relative change to odds of the response variables Odds-ratio = 1, the coefficient has no effect. Odds-ratio is <1, the coefficient predicts a decreased chance of an event occurring. Odds-ratio is >1, the coefficient predicts an increased chance of an event occurring.
prob <- predict(model, test_data,type=“response”) #set type to response to convert to prob rather than log odds
Slopes: estimates linear coefficient for continuous variable; requires global intercept Intercepts: Discrete groups used to predict; reference intercept + contract (y~x) or intercept for each group y~x-1 Link: relationship of response variables to mean Offset: adjustment for exposure model.matrix codes groups into variables e.g. model.matrix(~x1+x2) lm(y~ . -x1) fits model using all variables excluding x1 glm(y~1) fits a null model i.e. no predictors
Claim incidence - logistic regression
Interpreting co-efficients; adding predicted values to data;
model_1 <- glm(inc_count_sick~age,data=df_train,family="binomial") # use the default link
model_2 <- glm(inc_count_sick~age,data=df_train,family=binomial(link="logit")) # specify the link, logit is default for binomial
summary(model_2)##
## Call:
## glm(formula = inc_count_sick ~ age, family = binomial(link = "logit"),
## data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.1719 -0.0925 -0.0764 -0.0662 3.8636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.143801 0.221246 -45.85 <2e-16 ***
## age 0.095743 0.004542 21.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20946 on 449999 degrees of freedom
## Residual deviance: 20499 on 449998 degrees of freedom
## AIC: 20503
##
## Number of Fisher Scoring iterations: 9
Adding more response variables
model_3 <- glm(inc_count_sick~age+sex+waiting_period,data=df_train,family=binomial(link="logit"))
summary(model_3)##
## Call:
## glm(formula = inc_count_sick ~ age + sex + waiting_period, family = binomial(link = "logit"),
## data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2268 -0.0996 -0.0748 -0.0521 4.0388
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.560415 0.243023 -39.340 < 2e-16 ***
## age 0.095365 0.004538 21.016 < 2e-16 ***
## sexf -0.723910 0.080513 -8.991 < 2e-16 ***
## sexu -0.392738 0.133308 -2.946 0.00322 **
## waiting_period14d -0.225911 0.107044 -2.110 0.03482 *
## waiting_period720d -1.831853 0.184362 -9.936 < 2e-16 ***
## waiting_period90d -1.590750 0.155150 -10.253 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20946 on 449999 degrees of freedom
## Residual deviance: 20033 on 449993 degrees of freedom
## AIC: 20047
##
## Number of Fisher Scoring iterations: 9
But how do you know when a model is a good fit or whether you have captured the right number and combination of variables? A section on validation techniques and model comparisons later. When deciding on explanatory variables to model, consider the properties of the data (like correlation or colinearity)
Actuals or AvE?
Stepwise regression
This method builds a regression model from a set of candidate predictor variables by entering predictors based on p values, in a stepwise manner until there are no variables left to enter any more. Model may over/ understate the importance of predictors and the direction of stepping (forward or backward) can impact the outcome - so some degree of interpretation is necessary.
# specify a null model with no predictors
null_model_sick <- glm(inc_count_sick ~ 1, data = df_train, family = "binomial")
# specify the full model using all of the potential predictors
full_model_sick <- glm(inc_count_sick ~ cal_year + policy_year + sex + smoker + benefit_period + waiting_period + occupation + age + sum_assured, data = df_train, family = "binomial")
# use a forward stepwise algorithm to build a parsimonious model
step_model_sick <- step(null_model_sick, scope = list(lower = null_model_sick, upper = full_model_sick), direction = "forward")## Start: AIC=20948.4
## inc_count_sick ~ 1
##
## Df Deviance AIC
## + age 1 20499 20503
## + waiting_period 3 20581 20589
## + policy_year 1 20833 20837
## + sum_assured 1 20842 20846
## + sex 2 20844 20850
## + smoker 2 20925 20931
## + cal_year 1 20932 20936
## <none> 20946 20948
## + benefit_period 2 20945 20951
## + occupation 4 20941 20951
##
## Step: AIC=20503.31
## inc_count_sick ~ age
##
## Df Deviance AIC
## + waiting_period 3 20135 20145
## + sex 2 20398 20406
## + smoker 2 20478 20486
## + sum_assured 1 20482 20488
## <none> 20499 20503
## + policy_year 1 20498 20504
## + cal_year 1 20499 20505
## + occupation 4 20494 20506
## + benefit_period 2 20498 20506
##
## Step: AIC=20145.08
## inc_count_sick ~ age + waiting_period
##
## Df Deviance AIC
## + sex 2 20033 20047
## + smoker 2 20114 20128
## + sum_assured 1 20118 20130
## <none> 20135 20145
## + policy_year 1 20134 20146
## + cal_year 1 20135 20147
## + occupation 4 20129 20147
## + benefit_period 2 20134 20148
##
## Step: AIC=20047.43
## inc_count_sick ~ age + waiting_period + sex
##
## Df Deviance AIC
## + smoker 2 20012 20030
## + sum_assured 1 20017 20033
## <none> 20033 20047
## + policy_year 1 20032 20048
## + cal_year 1 20033 20049
## + occupation 4 20028 20050
## + benefit_period 2 20032 20050
##
## Step: AIC=20030.01
## inc_count_sick ~ age + waiting_period + sex + smoker
##
## Df Deviance AIC
## + sum_assured 1 19995 20015
## <none> 20012 20030
## + policy_year 1 20011 20031
## + cal_year 1 20012 20032
## + occupation 4 20006 20032
## + benefit_period 2 20011 20033
##
## Step: AIC=20015.39
## inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
##
## Df Deviance AIC
## <none> 19995 20015
## + policy_year 1 19994 20016
## + cal_year 1 19995 20017
## + benefit_period 2 19994 20018
## + occupation 4 19995 20023
summary(full_model_sick)##
## Call:
## glm(formula = inc_count_sick ~ cal_year + policy_year + sex +
## smoker + benefit_period + waiting_period + occupation + age +
## sum_assured, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2833 -0.0973 -0.0744 -0.0519 4.0320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.714e+01 6.605e+01 -0.714 0.475438
## cal_year 1.863e-02 3.272e-02 0.569 0.569162
## policy_year 1.043e-02 1.127e-02 0.926 0.354571
## sexf -7.239e-01 8.052e-02 -8.991 < 2e-16 ***
## sexu -3.903e-01 1.333e-01 -2.927 0.003419 **
## smokers 3.465e-01 7.424e-02 4.667 3.06e-06 ***
## smokeru -1.074e-01 1.239e-01 -0.867 0.385941
## benefit_period2yr -9.194e-02 8.108e-02 -1.134 0.256790
## benefit_period5yr -3.869e-02 7.953e-02 -0.487 0.626584
## waiting_period14d -2.289e-01 1.071e-01 -2.138 0.032537 *
## waiting_period720d -1.834e+00 1.844e-01 -9.949 < 2e-16 ***
## waiting_period90d -1.594e+00 1.552e-01 -10.270 < 2e-16 ***
## occupationsed -2.231e-02 9.764e-02 -0.229 0.819248
## occupationtechn -4.968e-02 1.022e-01 -0.486 0.626775
## occupationwhite -6.086e-02 1.013e-01 -0.601 0.548076
## occupationblue -3.212e-02 1.189e-01 -0.270 0.786998
## age 8.702e-02 5.383e-03 16.167 < 2e-16 ***
## sum_assured 4.214e-05 1.243e-05 3.391 0.000696 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20946 on 449999 degrees of freedom
## Residual deviance: 19992 on 449982 degrees of freedom
## AIC: 20028
##
## Number of Fisher Scoring iterations: 9
summary(step_model_sick)##
## Call:
## glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker +
## sum_assured, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2830 -0.0973 -0.0745 -0.0519 4.0246
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.640e+00 2.432e-01 -39.635 < 2e-16 ***
## age 8.953e-02 4.755e-03 18.829 < 2e-16 ***
## waiting_period14d -2.295e-01 1.071e-01 -2.144 0.03202 *
## waiting_period720d -1.835e+00 1.844e-01 -9.951 < 2e-16 ***
## waiting_period90d -1.594e+00 1.552e-01 -10.273 < 2e-16 ***
## sexf -7.238e-01 8.052e-02 -8.989 < 2e-16 ***
## sexu -3.916e-01 1.333e-01 -2.937 0.00331 **
## smokers 3.466e-01 7.424e-02 4.669 3.03e-06 ***
## smokeru -1.067e-01 1.239e-01 -0.862 0.38885
## sum_assured 4.333e-05 1.064e-05 4.072 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20946 on 449999 degrees of freedom
## Residual deviance: 19995 on 449990 degrees of freedom
## AIC: 20015
##
## Number of Fisher Scoring iterations: 9
Predictions
Blah.
Evaluation
Techniques
\(R^{2}\) statistic (coefficient of determination) measures the proportion of variance in the dependent variable that can be explained by the independent variables. Adjusted \(R^{2}\), adjusts for the number of predictors in the model. The adjusted R-squared increases when the new predictor improves the model more than would be expected by chance.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
summary(model_3)##
## Call:
## glm(formula = inc_count_sick ~ age + sex + waiting_period, family = binomial(link = "logit"),
## data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2268 -0.0996 -0.0748 -0.0521 4.0388
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.560415 0.243023 -39.340 < 2e-16 ***
## age 0.095365 0.004538 21.016 < 2e-16 ***
## sexf -0.723910 0.080513 -8.991 < 2e-16 ***
## sexu -0.392738 0.133308 -2.946 0.00322 **
## waiting_period14d -0.225911 0.107044 -2.110 0.03482 *
## waiting_period720d -1.831853 0.184362 -9.936 < 2e-16 ***
## waiting_period90d -1.590750 0.155150 -10.253 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20946 on 449999 degrees of freedom
## Residual deviance: 20033 on 449993 degrees of freedom
## AIC: 20047
##
## Number of Fisher Scoring iterations: 9
AIC(model_1)## [1] 20503.31
AIC(model_3)## [1] 20047.43
stepAIC(step_model_sick)## Start: AIC=20015.39
## inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
##
## Df Deviance AIC
## <none> 19995 20015
## - sum_assured 1 20012 20030
## - smoker 2 20017 20033
## - sex 2 20097 20113
## - age 1 20352 20370
## - waiting_period 3 20360 20374
##
## Call: glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker +
## sum_assured, family = "binomial", data = df_train)
##
## Coefficients:
## (Intercept) age waiting_period14d waiting_period720d
## -9.640e+00 8.953e-02 -2.295e-01 -1.835e+00
## waiting_period90d sexf sexu smokers
## -1.594e+00 -7.238e-01 -3.916e-01 3.466e-01
## smokeru sum_assured
## -1.067e-01 4.333e-05
##
## Degrees of Freedom: 449999 Total (i.e. Null); 449990 Residual
## Null Deviance: 20950
## Residual Deviance: 20000 AIC: 20020
anova(model_2,step_model_sick)## Analysis of Deviance Table
##
## Model 1: inc_count_sick ~ age
## Model 2: inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
## Resid. Df Resid. Dev Df Deviance
## 1 449998 20499
## 2 449990 19995 8 503.92
So when comparing models, the higher the
Residual plots
Confusion matrix
ROC/ AUC
ROC (Receiver Operating Characteristic) curve
Out of sample predictions
References
Bibliography
Further reading
- R Markdown Cookbook has everything you need to know to set up an r markdown document like this one.
- Generalised Linear Models for Insurance Data, book
- Tidyverse documentation full set of documetation for Tidyverse packages (packages for data science) e.g. dplyr for data manipulation; tidyr for tidying up messy data; ggplot for visualisation.